Skip to content

feat(adapter/nemo): cleanup checkpoint container on_train_end#94

Open
Leahlijuan wants to merge 9 commits intomainfrom
feat/cleanup
Open

feat(adapter/nemo): cleanup checkpoint container on_train_end#94
Leahlijuan wants to merge 9 commits intomainfrom
feat/cleanup

Conversation

@Leahlijuan
Copy link
Copy Markdown
Collaborator

This change adds on_train_end in the MLFlashpointCheckpointCallback, which will shutdown replication manager once the ML-Flashpoint checkpoint save done, and remove the mlf checkpoint container.

@Leahlijuan Leahlijuan requested review from g-husam and kkkapu March 30, 2026 21:14
return

for cb in mlf_callbacks:
cb.replication_manager = replication_manager
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just initialize this earlier to pass it to the callback? Or do you think this is better? It might complicate the recipes a little which isn't great, but would be more straightforward

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it for now so that we don't need to modify recipes

@g-husam g-husam changed the title Feat/cleanup: cleanup checkpoint container on_train_end feat(adapter/nemo): cleanup checkpoint container on_train_end Apr 2, 2026
@Leahlijuan Leahlijuan requested a review from g-husam April 6, 2026 17:18
@Leahlijuan Leahlijuan requested a review from g-husam April 7, 2026 13:48
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Python Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint 100% 100%
src.ml_flashpoint.adapter 100% 100%
src.ml_flashpoint.adapter.megatron 97% 95%
src.ml_flashpoint.adapter.nemo 98% 94%
src.ml_flashpoint.adapter.pytorch 99% 92%
src.ml_flashpoint.checkpoint_object_manager 93% 93%
src.ml_flashpoint.core 95% 92%
src.ml_flashpoint.replication 81% 81%
Summary 95% (2364 / 2493) 92% (559 / 610)

Minimum allowed line rate is 90%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

C++ Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object 93% 54%
src.ml_flashpoint.checkpoint_object_manager.object_manager 69% 33%
src.ml_flashpoint.replication.transfer_service 79% 40%
Summary 81% (924 / 1142) 43% (698 / 1638)

Minimum allowed line rate is 80%

namespace fs = std::filesystem;

namespace {
// We use a fork/exec approach calling 'rm -rf' here instead of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, is this the only way to get around that? do we know why there was a seg fault?

if os.path.isdir(container_id):
# Use shutil.rmtree for recursive deletion.
shutil.rmtree(container_id)
shutil.rmtree(container_id, onerror=_onerror)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note: this delete_container function is synchronous, and doesnt use the async C++ impl for delete dir, so we should be careful of when we use each. We might want to make this call that one, and allow for blocking, for consistency

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we delete_container before it's fully finished (save + replication), so the transfer service might make changes to the dir at the same time (in the receiver, we save to a tmp file first and then rename it), so there could be file not found error.

int peer_port_;
size_t max_size_;
std::queue<int> available_connections_; // Guarded by mtx_.
std::queue<int> available_connections_; // Guarded by mtx_.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain the rationale for using a queue here and unordered set below? specifically why FIFO order is relevant for available_connections.

also are these two collections mutually exclusive?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they should be be mutually exclusive. For the queue usage for available_connections_, ideally no difference if we using a queue or a stack, just a way to track them and get one to use quickly.

adding active_connections_ as we want to destroy all the alive connection during shutdown

}
if (reuse) {
if (available_connections_.size() < max_size_) {
LOG(INFO) << "ConnectionPool::ReleaseConnection: reuse connection";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these INFO logs here and below are debug logs, I would avoid using INFO for them so we dont pollute the logs. Maybe use VLOG(3).

Can follow this guideline:

Level Usage Case Frequency
VLOG(1) High-level flow: Major state changes or important function entries (e.g., "Pool initialized", "Connection created"). Low
VLOG(2) Detailed events: Per-request or per-connection logic (e.g., "Connection X added to active set"). Medium
VLOG(3) Deep tracing: Logic branches inside loops or complex conditionals. High
VLOG(4)+ Extremely noisy: Byte-level data, heartbeats, or internal mutex locking/unlocking details. Very High

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants